Implies that a constant change in a predictor leads to a constant change in the response
This is intuitive for a response that has a normal distribution: can vary effectively indefinitely in either direction, or over small range (e.g. heights)
This is inappropriate for a wide range of types of response variables common in animal science data sets
Introduction
Animal science data are often discrete or categorical
Binary (dichotomous) variables
Suffering from a particular disease
Obese or normal weight
Count variables
number of disease recordings per herd
number of discrete play behaviours observed
Also continuous, positive (or non-negative) data such as concentrations of particular compounds in blood
Logistic Regression
Logistic Regression
A binary or dichotomous variable can take only one of two values
Numerically these are coded as 0 and 1 when we fit a logistic regression
a value of 1 is generally known as the success state; more usefully this indicates cases where risk factor or outcome of interest is present,
a value of 0 is the failure state; more usefully the absence of the risk factor or outcome of interest
Alternatively, we may have a data representing a proportion of successes from some total in a set of \(m\) trials or experiments
Such variables are strictly bounded at 0 and 1
Logistic Regression
Association between high somatic cell count (SCC) & milk yield in dairy cows
Define high SCC as SCC > 125,000 cells ml-1
Linear regression line could predicts negative probability of high SCC
Variance unlikely to be constant; reduced at 0 & 1
\(\displaystyle \frac{p}{1 - p}\) is the odds, a measure that compares the probability of an event happening with the probability that it does not happen
If \(p\) = 0.25 the odds are \(\frac{0.25}{1 - 0.25} = \frac{1}{3}\) or 1 success for every 3 failures
If \(p\) = 0.8 the odds are \(\frac{0.8}{1 - 0.8} = 4\) or 4 success for each failure
The logit is the natural log of the odds; odds are non-negative number, but log odds can be any real number
\(\beta_0\) a constant term; predicted log-odds of success (high SCC) where \(x\) is 0 (0 milk yield)
\(\beta_1\) is the slope; predicted change in the log-odds of success (high scc) for a 1 unit increase in \(x\) (an additional KG of milk per day)
Because log-odds can take any value
\(\beta_1 \mathsf{> 1}\); increasing \(x\) associated with an increase in probability of success
\(\beta_1 \mathsf{< 1}\); increasing \(x\) associated with a decrease in probability of success
\(\beta_1 \mathsf{= 0}\); increasing \(x\) has no effect on probability of success
Logistic Regression
A method known as maximum likelihood (ML) is used to estimate values of \(\beta_0\) and \(\beta_j\) from the data, the maximum likelihood estimates (MLEs)
So called because these estimates are the ones that make the data most likely under the model
In linear regression we minimised the residual sums of squares
With GLMs we minimise the deviance
Can compare likelihoods of two models using a likelihood ratio test
Logistic Regression
m <-glm(high_scc ~ milk_yield_kg, data = scc, family = binomial)m0 <-glm(high_scc ~1, data = scc, family = binomial)anova(m0, m, test ="LRT")
We see that compared to the null model of no change in risk of high SCC, the milk yield of a cow has a statistically significant effect on the probability of high SCC
Estimate contains the estimates for \(\beta_j\)in log odds; easier to interpret on odds scale
Odds of high SCC as a non-yielding cow are 2.268
\(\beta_1\) is negative, so higher yielding cows are associated with decreased risk of high SCC; for each additional kg of yield, odds of high SCC decrease 0.965 times
Odds ratio
Odds ratio of High SCC where milk_yield_kg is 20 and 25 kg
exp(-0.036* (25-20))
[1] 0.8352702
m |>comparisons(variables =list(milk_yield_kg =c(20, 25)),comparison ="lnoravg", transform ="exp")
Tests of \(H_0: \; \beta_j = 0\) are done using a Wald statistic instead of the \(t\) test
\[z = \frac{\beta_j - 0}{\mathsf{SE}(\beta_j)}\]
Rather than following a \(t\) distribution, Wald statistics are approximately normally distributed; probability of seeing a \(z\) as extreme as observed under a standard normal distribution
Logistic Regression
Fitted functions on the scale of \(\eta\) and the response
lr_p1 <- m |>plot_predictions(by ="milk_yield_kg", type ="link") +labs(y ="Log odds", x ="Milk yield (kg)")lr_p2 <- m |>plot_predictions(by ="milk_yield_kg", type ="link") +labs(y ="Probability of high SCC", x ="Milk yield (kg)")lr_p1 + lr_p2
Logistic Regression
new_df <-data.frame(milk_yield_kg =seq(3, 100, by =1))lr_p1 <- m |>plot_predictions(by ="milk_yield_kg",type ="link", newdata = new_df) +labs(y ="Log odds", x ="Milk yield (kg)")lr_p2 <- m |>plot_predictions(by ="milk_yield_kg",type ="response", newdata = new_df) +labs(y ="Probability of high SCC", x ="Milk yield (kg)")lr_p1 + lr_p2
Generalised Linear Models
Generalised Linear Models
Generalised Linear Models (GLMs) are a powerful extension to the linear regression model, extending the types of data & conditional distributions that can be modelled beyond the normal or Gaussian distribution of linear regression
Binary (dichotomous) variables can be modelled using logistic regression
Count data can be modelled using Poisson regression
These are two special cases of the broad class of GLMs
Also includes the linear regression as a special case
Generalised Linear Models
Logistic regression is a special case of the the GLM where the conditional distribution of the response was assumed to be Binomial
GLMs allow the conditional distribution of the response to be any distribution from the exponential family; Poisson, binomial, Gaussian, gamma, multinomial, …
There are three parts to a GLM
The conditional distribution or \(y\)
The linear predictor\(\eta\), and
The link function
Whilst this affords a degree of choice, often natural selections for the conditional distribution of \(y\) and the link function arise from type of data being modelled
Generalised Linear Models
In a GLM we want a model for the expectation of \(y\), \(\mathsf{E}(y_i)\), which is commonly abbreviated to \(\mu_i\)
We might model \(\mu_i\) as following a Poisson distribution if the data were counts, or as a binomial distribution in the case of dichotomous data, as we did the in the high SCC example
We need to decide which predictor variables and any transformations of them should be used to predict \(y\); this is the linear predictor, \(\eta\)
Finally we need to map the values from the response scale on to a linear scale, just as we mapped from probabilities to log-odds using the logit transformation. This is the link function, \(g()\)
GLM: Link Function
The link function maps from the response scale to the linear scaled
\[g(\mu_i) = \eta_i\]
To map from the linear scale back to the response we need to divide by \(g()\), which means we need the inverse of \(g\), \(g^{-1}\):
\[\mu_i = g^{-1}(\eta_i)\]
GLM: Link Function
Sounds complicated but look again at the logistic regression
The logit transformation linearises the relationship & its inverse maps back from the linear scale to the non-linear response scale
Other link functions are suitable for use with other types of GLMs; in a Poisson GLM, as we’ll see next, common to use the log link function
Poisson GLM
Poisson GLM
Often, we’ll encounter count data of some sort; number of cases of a disease in an epidemic
Counts are strictly non-negative; can’t have a count of less than 0
Variance increases as a function of the mean (lambda; \(\lambda\))
Poisson GLM
The Poisson distribution is a standard distribution for modelling count data
The Poisson gives the distribution of the number of things (individuals, counts, events) in a given sampling interval/effort if each event is independent
The Poisson is a simple distribution; defined by a single parameter \(\lambda\), which is the mean and the variance
The standard link function for a Poisson GLM is the log link; maps non-negative counts on to a linear scale
Mayekar & Kodandaramaiah (2017; PLoS One, 12, e0171482) studied determinants of pupal colour (green vs. brown) of a tropical butterfly
Response
Green vs Brown
Predictors
time to pupation
pupal weight
sex
Discuss in your groups:
what types of data each of the variables is?
what kind of GLM is appropriate?
Problematic grizzly bears
Moorehouse et al (2016; PLoS One, 11, e0165425) used genetic analyses to determine parentage of bear cubs and cross classified cubs and mothers and whether they caused problems around humans
Problematic grizzly bears
Table shows the number of cubs and mothers classified by whether they are problematic or not
Mother
Total
Yes
No
Offspring
Yes
5
18
23
No
3
50
53
Total
8
68
76
Discuss in your groups:
what types of data each of the variables is?
what kind of GLM is appropriate?
Cat Owner’s Attitudes
Teng et al (2020; PLoS One, 15, e0234190) surveyed cat owners in Australia on factors that might affect affect the prevalence of overweight and obese cats